!pip install pangaeapyData Science in Environmental Science and Oceanography with R and Python
Course Notes
1 About
This guide was created for the course “Data Science in Environmental Science and Oceanography with R and Python” at the Informatica Feminale 2025.
2 Agenda
2.1 Agenda - Day 1: PANGAEA and R & Python basics
| Aug 20th | Topic | Presenter |
|---|---|---|
| 10:00 | Round of introductions (20 min) | all including participants |
| 10:00 | Intro PANGAEA, data repos (20 min) | DR |
| 10:20 | How to find and use data from PANGAEA (15 mins) | KRC |
| 10:35 | Quiz (10 min) | KRC |
| 10:45 | How to download datasets (10 min) | KRC |
| 10:55 | Questions I (10 min) | KRC & DR |
| 11:05 | break | |
| 11:15 | Intro to PANGAEA packages and tokens (20-25 min) | DR |
| 11:35 | Questions II (10min) | DR & KRC |
| 11:45 | Lunch break | |
| 02:00 | basics R and python | DS |
| 04:30 | end of day 1 |
2.2 Agenda - Day 2: Getting data from PANGAEA und Data Cleaning
| Aug 21st | Topic | Presenter |
|---|---|---|
| 09:00 | Downloading and manipulating PANGAEA data in R - in brief | DR |
| 09:45 | Minibreak | |
| 09:50 | Downloading and manipulating PANGAEA data in Python (100 min including bio break after 45 min) | KRC |
| 12:30 | Lunch break | |
| 02:00 | Data Cleaning in python | KRC |
| 03:40 | Data Cleaning in R | DS |
| 04:30 | end of day 2 |
2.3 Agenda - Day 3: Plotting und Statistics
| Aug 22nd | Topic | Presenter |
|---|---|---|
| 09:00 | Intro and Cleaning of Penguin data set, Plotting & Statistics | DS & MR |
| 12:30 | Lunch break | |
| 02:00 | Markdown & time for open Questions | all |
| 04:30 | end of day 3 |
3 Before the course
3.1 Installation Notes
If you haven’t installed R and RStudio yet, use following instructions for the installation:
- Install R:
Installation on Windows and Mac https://www.dataquest.io/blog/installing-r-on-your-computer/ For Linux https://cran.r-project.org/doc/FAQ/R-FAQ.html#How-can-R-be-installed-_0028Unix_002dlike_0029
Install the free RStudio Desktop Version: https://www.rstudio.com/products/rstudio/#Desktop
Do you know how you can install new R packages? You can read about it here: https://thedatacommunity.org/2017/09/17/installing-loading-unloading-and-removing-r-packages-in-rstudio/
In this script, we make use of following packages:
- pangaear
- dplyr
- tidyr
- stringr
- lubridate
- httr
- worrms
- knitr
- magrittr
- car
- palmerpenguins
- reshape
- ggplot2
- (reticulate)
Very many integrated development environment (IDE) software programs exist for Python. For this beginners course, we recommend the free and open source Python IDE called Thonny.
Install Python:
Thonny is available for Windows, Mac and Linux. Just download the suitable executable installation file for your computer and start the installation. The basic set up of Python is done. When you open Thonny for the first time, you will see the default set up of windows with a code editor on the upper left side, a shell on the lower left side and an assistant on the right hand side:Install additional Python packages:
During our course, we will use several additional packages. All can be installed either via the package installer PIP.
Example: Write the following code in the shell to install PANGAEApy.
Or click in the upper left hand corner of Thonny on Tools and select Manage packages …. Search jupyter on PyPI and click on the jupyter search result to start the installation.
The following packages are needed for this course:
- jupyter
- pangaeapy
- numpy
- pandas
- re
- os
- datetime
- ipdb
- requests
- LatLon23
- difflib
- collections
- matplotlib
- plotly
3.2 Programming Environments
For R, we recommend RStudio.
3.2.0.1 Creating a new project
In many statistics classes they teach you, that you should set a working directory:
setwd("~/Documents/")It can be helpful to create a project instead. All files within the project’s directory will automatically have the correct working directory. In this way you can create new files and start coding easily.
- Click File -> New Project… Even when you choose “Existing Directory” you will still be able to create a new directory
- Enter the diretory path straight away, or click on browse.
- Choose an existing directory or create a new one
3.2.0.2 RStudio screen
- In the upper left hand corner you have the code editor. This is where you can type your scripts.
- In the bottom left hand corner you have the console. Here you will see the results of all commands you are running.
- In the upper right hand corner you can see all objects you created, e.g. data frames, functions, strings, …
- In the bottom left hand corner you will be able to see the files in your project, your plots and more information e.g. about functions
You can use “Ctrl” followed by a number to jump to a specific window. If you want to get into the terminal use “Shift+Alt+t” or “Shift+Alt+m”. You can find an overview of all short cuts using “Shift+Alt+k”.
3.2.0.3 R Workflow
- You type your scripts in the code editor, so that you can save them.
- If you want to run the whole script, you can click on ‘Source’ at the tope of the code editor or press “Ctrl + Alt + R”. (Shortcuts might differ depending on your system.)
- Use “Ctrl + Alt + B/E” to run from the beginning to the current line or from the current line to the end.
- If you want to run a single line, you can click on ‘Run’ or press “Control + Enter”
- If you want to run a few lines, highlight the lines you want to run and use ‘Run’/ “Control + Enter”
- If you want to use Autocomplete, you can press “Control + Space”
- If you want to read about an existing method, you can type “?<Name of the method>” to get to the documentation
3.2.0.4 Save your code in a script or notebook
Python code is saved either in scripts with .py ending or in so called Jupyter Notebooks with .ipynb ending. There is no difference in the python code itself only in the way the code is executed.
Files with .py ending need to be executed in a shell or terminal, somewhere you type the command to execute your script. If you want to know more about the differences between a shell and a terminal, have a look at GeeksforGeeks.
Files with .ipynb ending consist of individual cells which can be executed directly in the Jupyter Notebook.
Note: If you open a Jupyter Notebook in an editor, you will see that the python code is mixed with additional code to interprete each notebook cell. You can convert .py files to .ipynb files an vice versa using a command line tool in the shell as described in the nbconvert documentation.
In this course we will work with Jupyter Notebook. To open a Jupyter environment write in the shell window of Thonny:
!jupyter notebook
After pressing enter, a browser window will open to show you your Home directory. Double click on your Jupyter Notebook of choice to open it or open a new one with the File button, select new and Notebook. Select the suggested Kernel and start scripting.
Type anything you want, e.g. print(‘Hello World’), into the cell and execute it by clicking on the button with the single triangle.
3.3 Data types
3.3.0.1 Character
firstname <- "Pipi"
surname <- "Langstrumpf"
name <- paste(firstname, surname)
name[1] "Pipi Langstrumpf"
3.3.0.2 Numeric
a <- 1
b <- 2
c <- a + b
c[1] 3
d <- b*c/a
d[1] 6
e <- d^2
e[1] 36
f <- as.integer(3.14)
f[1] 3
3.3.0.3 Logical
a <- TRUE
b <- FALSE
# or
a|b[1] TRUE
# and
a&b[1] FALSE
# not
!a[1] FALSE
4 == 5[1] FALSE
4 < 5[1] TRUE
4>=4[1] TRUE
is.nafunction (x) .Primitive("is.na")
as.numeric(a)[1] 1
3.3.0.4 && and ||
c(T,T,T) & c(T,F,F)[1] TRUE FALSE FALSE
c(T,T,T) && c(T,F,F)Error in c(T, T, T) && c(T, F, F): 'length = 3' in coercion to 'logical(1)'
IDontExistError: object 'IDontExist' not found
T | IDontExistError: object 'IDontExist' not found
T || IDontExist[1] TRUE
3.3.0.5 Vectors
# a simple numeric vector
myvector <- c(1,4,6)
myvector[1] 1 4 6
# a vector of type character
myfamily <- c("Paula", "Ali", "Emma")
myfamily[1] "Paula" "Ali" "Emma"
# get the first element of a vector
myfamily[1][1] "Paula"
# apply function to all items:
myvector + 1[1] 2 5 7
paste(myfamily, "Meier")[1] "Paula Meier" "Ali Meier" "Emma Meier"
# concatenate vectors:
longervector <- c(myvector, 5:7)
longervector[1] 1 4 6 5 6 7
# create a sequence
odd <- seq(from = 1, to = 10, by = 2)
odd[1] 1 3 5 7 9
# create a boring sequence
boring <- rep(1, 10)
boring [1] 1 1 1 1 1 1 1 1 1 1
3.3.0.6 Factors
fac <- factor(c("good", "better", "best"))
levels(fac)[1] "best" "better" "good"
nlevels(fac)[1] 3
3.3.0.7 List of lists
You can create a list of lists (a list of vectors):
myList <- list(smallLetters=letters[1:10],
capitalLetters=LETTERS[1:10],
numbers=1:5)
myList$smallLetters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
$capitalLetters
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
$numbers
[1] 1 2 3 4 5
If you want to choose lists, you can do it with a single bracket.
# Accessing multiple elements
myList[1:2]$smallLetters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
$capitalLetters
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
If you want to choose single elements, you need double brackets:
# Accessing single elements
myList[2][2]$<NA>
NULL
myList[[2]][2][1] "B"
myList[[1:2]][1] "b"
3.3.0.8 Matrices
m <- matrix(data=1:12, nrow = 3, ncol = 4)
m [,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
Element wise operations:
m [,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
# Element-wise operations
m * 0.5 [,1] [,2] [,3] [,4]
[1,] 0.5 2.0 3.5 5.0
[2,] 1.0 2.5 4.0 5.5
[3,] 1.5 3.0 4.5 6.0
m [,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
# Element-wise operations
m * c(1,2,3) [,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 4 10 16 22
[3,] 9 18 27 36
matrix/ vector multiplication
m [,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
# Matrix multiplication
m %*% c(1,2,3,4) [,1]
[1,] 70
[2,] 80
[3,] 90
Good explanation of Matrix multiplication: Eli Bendersky’s website
3.3.0.9 Data frames
# Creating a data frame
family.frame <- data.frame(index = 1:3, firstname = myfamily, surname = rep("Meier", 3))library(knitr)
kable(family.frame)| index | firstname | surname |
|---|---|---|
| 1 | Paula | Meier |
| 2 | Ali | Meier |
| 3 | Emma | Meier |
A comprehensive list of data types is given in the Python documentation.
3.3.0.10 Character aka String
firstname = 'Pipi'
surname = 'Langstrumpf'
name = firstname+surname
print(name)PipiLangstrumpf
name = firstname+' '+surname
print(name)Pipi Langstrumpf
3.3.0.11 Numeric
import numpy as np
a = 1
b = 2
c = a + b
print(c)3
d = b*c/a
print(d)6.0
e = d ** 2
print(e)36.0
f = int(3.14)
print(f)3
g = float(3.14)
print(g)3.14
h = np.nan
float(h)nan
3.3.0.12 Logical
a = True
b = False
# or
a or bTrue
# and
a and bFalse
# not
not aFalse
4 == 5False
4 < 5True
4 >= 4True
3.3.0.13 Vectors aka list
# a simple numeric vector is similar to a list in python
myvector = [1,4,6]
print(myvector)[1, 4, 6]
# a vector of type character
myfamily = ["Paula", "Ali", "Emma"]
print(myfamily)['Paula', 'Ali', 'Emma']
# get the first element of a vector
myfamily[0]'Paula'
# append another item:
myfamily.append("Meier")
print(myfamily)['Paula', 'Ali', 'Emma', 'Meier']
# create a sequence
odd = list(range(1, 10, 2)) # start, stop, step
print(odd)[1, 3, 5, 7, 9]
# create a boring sequence
boring = list(range(10))
print(boring)[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# concatenate vectors:
longervector = myvector + odd
print(longervector)[1, 4, 6, 1, 3, 5, 7, 9]
3.3.0.14 Matrices aka Array
import numpy as np
m = np.arange(1, 13).reshape(3, 4)
print(m)[[ 1 2 3 4]
[ 5 6 7 8]
[ 9 10 11 12]]
Element wise operations:
# Element-wise operations
result = m * 0.5
print(result)[[0.5 1. 1.5 2. ]
[2.5 3. 3.5 4. ]
[4.5 5. 5.5 6. ]]
m = np.arange(1, 13).reshape(3, 4)
n = np.array([1, 2, 3, 4])
# Element-wise operations
result = m * n
print(result)[[ 1 4 9 16]
[ 5 12 21 32]
[ 9 20 33 48]]
matrix/ vector multiplication
m = np.arange(1, 13).reshape(3, 4)
v = np.array([1, 2, 3, 4])
# Matrix multiplication
result = np.dot(m, v)
print(result)[ 30 70 110]
# or:
result = np.matmul(m, v)
print(result)[ 30 70 110]
Good explanation of Matrix multiplication: Eli Bendersky’s website
3.3.0.15 Dataframes
Dataframes in Python are defined in the package Pandas.
import pandas as pd
# Creating a data frame from a list
myfamily = ['Paula', 'Ali', 'Emma']
lastname = ['Meier']*3
family_frame = pd.DataFrame(list(zip(myfamily,lastname)),columns=['firstname','surname'])
family_frame firstname surname
0 Paula Meier
1 Ali Meier
2 Emma Meier
4 PANGAEA - Introduction
The slides of the PANGAEA theory part are stored in github.
5 Basics R and Python
5.1 Preparing your data for analysis
5.1.0.1 Reading the data
Usually, you would use the “csv” format to read data into R. Sometimes you have other formats, e.g. “txt”. Optimally, you have exactly one line for the header. Make sure to choose the correct separator. You also have to tell R whether your file has a header or not.
df <- read.table("beedata.csv")Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 2 did not have 2 elements
That did not work. Oh, yes, the separator!
df <- read.table("beedata.csv", sep = ",")| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 |
|---|---|---|---|---|---|---|---|---|---|
| time | hive | h | t_i_1 | t_i_2 | t_i_3 | t_i_4 | t_i_5 | t_o | weight_kg |
| 1.565081418e+18 | 4 | NA | 27.25 | 26.375 | 23.4375 | 23.5 | 23.25 | 21.625 | NA |
| 1.565081428e+18 | 4 | NA | 27.25 | 26.4375 | 23.5 | 23.5 | 23.25 | 21.625 | NA |
| 1.565081431e+18 | 13 | 59.0554 | 22.875 | 22.1875 | 23 | 21.8125 | 22.6875 | 22.875 | 10.779 |
| 1.565081438e+18 | 4 | NA | 27.25 | 26.375 | 23.5 | 23.5625 | 23.25 | 21.625 | NA |
Looks ok, but we forgot the heading:
df <- read.table("beedata.csv", sep = ",", header = T)| time | hive | h | t_i_1 | t_i_2 | t_i_3 | t_i_4 | t_i_5 | t_o | weight_kg |
|---|---|---|---|---|---|---|---|---|---|
| 1.565081e+18 | 4 | NA | 27.250 | 26.3750 | 23.4375 | 23.5000 | 23.2500 | 21.625 | NA |
| 1.565081e+18 | 4 | NA | 27.250 | 26.4375 | 23.5000 | 23.5000 | 23.2500 | 21.625 | NA |
| 1.565081e+18 | 13 | 59.0554 | 22.875 | 22.1875 | 23.0000 | 21.8125 | 22.6875 | 22.875 | 10.779 |
| 1.565081e+18 | 4 | NA | 27.250 | 26.3750 | 23.5000 | 23.5625 | 23.2500 | 21.625 | NA |
| 1.565081e+18 | 4 | NA | 27.250 | 26.3750 | 23.5000 | 23.5625 | 23.2500 | 21.625 | NA |
That looks good!
Alternatively, we can use “read.csv”. In this case, the default parameters work for us.
df <- read.csv("beedata.csv")Take care, sometimes you don’t get an error, but your data is screwed up anyway:
df <- read.csv("sampleData.csv", sep = " ")| name. | age. | subject. | year |
|---|---|---|---|
| Li; | 18; | chemistry; | 1 |
| Svenson,Jacob; | 25; | psychology; | 12 |
| Raphaela; | 23; | psychology; | 1 |
Use the head function to inspect your data.
5.1.0.2 Transforming entire columns
The columns/ variables of a dataframe basically behave like entries in a named list. Each element of this list is a vector of a specific type. You can inspect the structure of a dataframe using the function “str”.
df <- read.csv("beedata.csv")
str(df)'data.frame': 171299 obs. of 10 variables:
$ time : num 1.57e+18 1.57e+18 1.57e+18 1.57e+18 1.57e+18 ...
$ hive : int 4 4 13 4 4 4 4 4 4 4 ...
$ h : num NA NA 59.1 NA NA ...
$ t_i_1 : num 27.2 27.2 22.9 27.2 27.2 ...
$ t_i_2 : num 26.4 26.4 22.2 26.4 26.4 ...
$ t_i_3 : num 23.4 23.5 23 23.5 23.5 ...
$ t_i_4 : num 23.5 23.5 21.8 23.6 23.6 ...
$ t_i_5 : num 23.2 23.2 22.7 23.2 23.2 ...
$ t_o : num 21.6 21.6 22.9 21.6 21.6 ...
$ weight_kg: num NA NA 10.8 NA NA ...
E.g. converting kg to g:
df$weight_g <- df$weight_kg*1000Marking high temperature values:
df$highTemp <- df$t_i_1>25| time | hive | h | t_i_1 | weight_kg | weight_g | highTemp |
|---|---|---|---|---|---|---|
| 1.565081e+18 | 4 | NA | 27.250 | NA | NA | TRUE |
| 1.565081e+18 | 4 | NA | 27.250 | NA | NA | TRUE |
| 1.565081e+18 | 13 | 59.0554 | 22.875 | 10.779 | 10779 | FALSE |
| 1.565081e+18 | 4 | NA | 27.250 | NA | NA | TRUE |
| 1.565081e+18 | 4 | NA | 27.250 | NA | NA | TRUE |
Dealing with the timestamp (nanoseconds to seconds)
df$time <- as.POSIXct(df$time/1000000000, origin="1970-01-01")| time | hive | h | t_i_1 | t_i_2 | t_i_3 | t_i_4 | t_i_5 | t_o | weight_kg | weight_g | highTemp |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2019-08-06 10:50:18 | 4 | NA | 27.250 | 26.3750 | 23.4375 | 23.5000 | 23.2500 | 21.625 | NA | NA | TRUE |
| 2019-08-06 10:50:28 | 4 | NA | 27.250 | 26.4375 | 23.5000 | 23.5000 | 23.2500 | 21.625 | NA | NA | TRUE |
| 2019-08-06 10:50:31 | 13 | 59.0554 | 22.875 | 22.1875 | 23.0000 | 21.8125 | 22.6875 | 22.875 | 10.779 | 10779 | FALSE |
| 2019-08-06 10:50:38 | 4 | NA | 27.250 | 26.3750 | 23.5000 | 23.5625 | 23.2500 | 21.625 | NA | NA | TRUE |
| 2019-08-06 10:50:48 | 4 | NA | 27.250 | 26.3750 | 23.5000 | 23.5625 | 23.2500 | 21.625 | NA | NA | TRUE |
5.1.0.3 Reading data
One easy way to read ascii data in Python is via Pandas. Many different formats can be read, e.g. “csv”, “txt” or “xlsx”. Ideally, you have exactly one line for the header. Data formats can be very divers. However, there are as many options to set before reading the data. Make sure to choose the correct separator.
# import package
import pandas as pd
# open the data as dataframe
df = pd.read_csv('beedata.csv')
# use head function to see the first 5 rows of the dataframe
df.head() time hive h t_i_1 ... t_i_4 t_i_5 t_o weight_kg
0 1.565081e+18 4 NaN 27.250 ... 23.5000 23.2500 21.625 NaN
1 1.565081e+18 4 NaN 27.250 ... 23.5000 23.2500 21.625 NaN
2 1.565081e+18 13 59.0554 22.875 ... 21.8125 22.6875 22.875 10.779
3 1.565081e+18 4 NaN 27.250 ... 23.5625 23.2500 21.625 NaN
4 1.565081e+18 4 NaN 27.250 ... 23.5625 23.2500 21.625 NaN
[5 rows x 10 columns]
# use describe function to get an overview of the values of the dataframe
df.describe() time hive ... t_o weight_kg
count 1.712990e+05 171299.000000 ... 155142.000000 111106.000000
mean 1.565408e+18 8.382956 ... 18.709573 28.435138
std 1.668141e+14 4.523060 ... 31.252958 10.019183
min 1.565081e+18 3.000000 ... -0.062500 -59.093070
25% 1.565285e+18 4.000000 ... 14.625000 24.154370
50% 1.565363e+18 12.000000 ... 18.250000 24.832320
75% 1.565592e+18 12.000000 ... 20.937500 39.548892
max 1.565686e+18 14.000000 ... 2072.500000 69.451870
[8 rows x 10 columns]
5.1.0.4 Transforming entire columns
The columns/ variables of a dataframe are called dataseries. Each item of each dataseries is of a specific type. You can inspect the structure of a dataframe using the function “info”.
# open the data as dataframe
df = pd.read_csv('beedata.csv')
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171299 entries, 0 to 171298
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 time 171299 non-null float64
1 hive 171299 non-null int64
2 h 95156 non-null float64
3 t_i_1 155193 non-null float64
4 t_i_2 155203 non-null float64
5 t_i_3 155232 non-null float64
6 t_i_4 155148 non-null float64
7 t_i_5 133220 non-null float64
8 t_o 155142 non-null float64
9 weight_kg 111106 non-null float64
dtypes: float64(9), int64(1)
memory usage: 13.1 MB
# you can call each dataseries via it's header, e.g.
df['weight_kg']0 NaN
1 NaN
2 10.77900
3 NaN
4 NaN
...
171294 24.31022
171295 NaN
171296 24.31170
171297 24.30817
171298 NaN
Name: weight_kg, Length: 171299, dtype: float64
E.g. converting kg to g:
df['weight_g'] = df['weight_kg']*1000.
df.head() time hive h t_i_1 ... t_i_5 t_o weight_kg weight_g
0 1.565081e+18 4 NaN 27.250 ... 23.2500 21.625 NaN NaN
1 1.565081e+18 4 NaN 27.250 ... 23.2500 21.625 NaN NaN
2 1.565081e+18 13 59.0554 22.875 ... 22.6875 22.875 10.779 10779.0
3 1.565081e+18 4 NaN 27.250 ... 23.2500 21.625 NaN NaN
4 1.565081e+18 4 NaN 27.250 ... 23.2500 21.625 NaN NaN
[5 rows x 11 columns]
Marking high temperature values:
df['highTemp'] = df['t_i_1']>25Data can have various date and time formats. Here, we have timesteps of nanoseconds since 1970-01-01. Thus, dealing with the timestamp (nanoseconds to seconds)
df['time'] = pd.to_datetime(df['time'], unit='ns', origin=pd.Timestamp('1970-01-01'))
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171299 entries, 0 to 171298
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 time 171299 non-null datetime64[ns]
1 hive 171299 non-null int64
2 h 95156 non-null float64
3 t_i_1 155193 non-null float64
4 t_i_2 155203 non-null float64
5 t_i_3 155232 non-null float64
6 t_i_4 155148 non-null float64
7 t_i_5 133220 non-null float64
8 t_o 155142 non-null float64
9 weight_kg 111106 non-null float64
10 weight_g 111106 non-null float64
11 highTemp 171299 non-null bool
dtypes: bool(1), datetime64[ns](1), float64(9), int64(1)
memory usage: 14.5 MB
df.head() time hive h ... weight_kg weight_g highTemp
0 2019-08-06 08:50:18 4 NaN ... NaN NaN True
1 2019-08-06 08:50:28 4 NaN ... NaN NaN True
2 2019-08-06 08:50:31 13 59.0554 ... 10.779 10779.0 False
3 2019-08-06 08:50:38 4 NaN ... NaN NaN True
4 2019-08-06 08:50:48 4 NaN ... NaN NaN True
[5 rows x 12 columns]
Take care, sometimes you don’t get an error, but your data is screwed up anyway:
df.head() time hive h ... weight_kg weight_g highTemp
0 2019-08-06 08:50:18 4 NaN ... NaN NaN True
1 2019-08-06 08:50:28 4 NaN ... NaN NaN True
2 2019-08-06 08:50:31 13 59.0554 ... 10.779 10779.0 False
3 2019-08-06 08:50:38 4 NaN ... NaN NaN True
4 2019-08-06 08:50:48 4 NaN ... NaN NaN True
[5 rows x 12 columns]
Use the head function to inspect your data.
5.2 Subsetting data
You can use squared brackets and boolean expressions to choose lines and columns: df[<expression to choose lines>,<expression to choose columns>]
Select and plot data from hive 4:
df.4 <- df[df$hive==4,]
plot(df.4$time, df.4$t_o, ylim=c(0,40))Select only the first 1000 lines:
df.4.sub <- df.4[1:1000,]
plot(df.4.sub$time, df.4.sub$t_o, ylim=c(0,40))Delete columns: / Choose only some columns
names(df) [1] "time" "hive" "h" "t_i_1" "t_i_2" "t_i_3"
[7] "t_i_4" "t_i_5" "t_o" "weight_kg" "weight_g" "highTemp"
df.some <- df[, c(1,2,10)]| time | hive | weight_kg |
|---|---|---|
| 2019-08-06 10:50:18 | 4 | NA |
| 2019-08-06 10:50:28 | 4 | NA |
| 2019-08-06 10:50:31 | 13 | 10.779 |
| 2019-08-06 10:50:38 | 4 | NA |
| 2019-08-06 10:50:48 | 4 | NA |
You can also use:
df.same <- df[, c("time", "hive", "weight_kg")]| time | hive | weight_kg |
|---|---|---|
| 2019-08-06 10:50:18 | 4 | NA |
| 2019-08-06 10:50:28 | 4 | NA |
| 2019-08-06 10:50:31 | 13 | 10.779 |
| 2019-08-06 10:50:38 | 4 | NA |
| 2019-08-06 10:50:48 | 4 | NA |
or
df.or <- df[, -c(3:9,11)]| time | hive | weight_kg | highTemp |
|---|---|---|---|
| 2019-08-06 10:50:18 | 4 | NA | TRUE |
| 2019-08-06 10:50:28 | 4 | NA | TRUE |
| 2019-08-06 10:50:31 | 13 | 10.779 | FALSE |
| 2019-08-06 10:50:38 | 4 | NA | TRUE |
| 2019-08-06 10:50:48 | 4 | NA | TRUE |
5.2.0.1 Useful functions
- summary
- head
- tail
- sum
- mean
You can use squared brackets and boolean expressions to choose lines and columns.
Select and plot data from hive 4:
df = pd.read_csv('beedata.csv')
df_4 = df[df['hive']==4]
import matplotlib.pyplot as plt
# quick plot with default arguments
plt.plot(df_4['time'],df_4['t_o'])For more plotting options see https://matplotlib.org/stable/gallery/index.html
fig, ax = plt.subplots()
ax.plot(df_4['time'],df_4['t_o'],'x')
ax.set(ylim=(0, 40))[(0.0, 40.0)]
plt.show()Select only the first 1000 lines:
df_4_sub = df_4[1:1000]
# combine head and atil and show the first and last 2 rows
pd.concat([df_4_sub.head(2),df_4_sub.tail(2)]) time hive h t_i_1 ... t_i_4 t_i_5 t_o weight_kg
1 1.565081e+18 4 NaN 27.2500 ... 23.5000 23.2500 21.6250 NaN
3 1.565081e+18 4 NaN 27.2500 ... 23.5625 23.2500 21.6250 NaN
1145 1.565091e+18 4 NaN 31.5625 ... 29.0625 28.8125 27.3750 NaN
1146 1.565091e+18 4 NaN 31.6250 ... 29.1250 28.8125 27.3125 NaN
[4 rows x 10 columns]
Delete columns: / Choose only some columns
df.columnsIndex(['time', 'hive', 'h', 't_i_1', 't_i_2', 't_i_3', 't_i_4', 't_i_5', 't_o',
'weight_kg'],
dtype='object')
df_some = df[["time", "hive", "weight_kg"]]
pd.concat([df_some.head(2),df_some.tail(2)]) time hive weight_kg
0 1.565081e+18 4 NaN
1 1.565081e+18 4 NaN
171297 1.565686e+18 12 24.30817
171298 1.565686e+18 4 NaN
5.2.0.2 Useful functions
- info
- head
- tail
- describe
- columns
5.3 Exercises - Data
## Exercises about data types
# 1. Create following sequences:
# 4, 8, 12, 16, 20, 24, 28
# 10, 8, 6, 4, 2, 0
# repeat "chocolate" 10 times in a vector
# 2. Starting with the following code, divide a by b and save it as variable c:
a <- "5555555"
b <- 3
# 3. Starting with following code, create the string/ character vector "group3trial1.csv"
group <- 3
trial <- 1
## Exercises about data frames
# 1. Read the data "allForarexData.csv"
# 2. Convert Exp0_OxygenTemp and Exp1_OxygenTemp. Devide by 1000. Save the result in the same column.
# 3. What are the means of Exp0_OxygenTemp and Exp1_OxygenTemp?
# 4. Save the timestamp in the correct format in UTC. The time was measured in seconds. (origin="1970-01-01")
# 5. When was the first measurement? When was the last measurement?
# 6. Create a subset containing only data measured after 2019-03-09 09:25:52.
# 7. Difficult: Exclude columns, where all measurements are the same.
# hint at the bottom of the file
## Exercises about both data frames and data types
# Use beedata.csv
# 1. Select the lines of the bee data set where the hive number is 13 or 14.
# 2. What are the classes of the beedata's columns hive and t_i_1?
# 3. Difficult: How can you print the classes of all columns?
# Hint
# hint "Data types", 7: one option is to check if the min value of a column equals the max value. You can use a loop or following function: sapply(flight, function(x){min(x)!=max(x)})## Exercises about data types
# 1. Create following sequences:
# 4, 8, 12, 16, 20, 24, 28
# 10, 8, 6, 4, 2, 0
# repeat "chocolate" 10 times in a vector
seq(4,28,4)
seq(10,0,-2)
rep("chocolate", 10)
# 2. Starting with the following code...:
a <- "5555555"
b <- 3
c <- as.numeric(a) / b
# 3. Starting with following code, create the filename "group3trial1.csv"
group <- 3
trial <- 1
paste("group", group, "trial", trial, ".csv", sep="")
## Exercises about data frames
# 1. Read the data "allForarexData.csv"
flight <- read.table("allForarexData.csv", header = T, sep = ",")
# 2. Convert Exp0_OxygenTemp and Exp1_OxygenTemp. Devide by 1000. Save the result in the same column.
flight$Exp0_OxygenTemp <- flight$Exp0_OxygenTemp/1000
flight$Exp1_OxygenTemp <- flight$Exp1_OxygenTemp/1000
# 3. What are the means of Exp0_OxygenTemp and Exp1_OxygenTemp?
mean(flight$Exp0_OxygenTemp)
mean(flight$Exp1_OxygenTemp)
# 4. Save the timestamp in the correct format in UTC. The time was measured in seconds.
flight$timeStamp <- as.POSIXct(flight$timeStamp, origin="1970-01-01", tz = "UTC")
# 5. When was the first measurement? When was the last measurement?
min(flight$timeStamp)
max(flight$timeStamp)
# 6. Create a subset containing only data measured after 2019-03-09 09:25:52.
flight.subtime <- flight[flight$timeStamp>"2019-03-09 09:25:52",]
# 7. Difficult: Exclude columns, where all measurements are the same.
# hint: one option is using: sapply(flight, function(x){min(x)!=max(x)})
ncol(flight)
flight.sub <- flight[, sapply(flight, function(x){min(x)!=max(x)})]
ncol(flight.sub)
## Exercises about both data frames and data types
# Use beedata.csv
# 1. Select the lines of the bee data set where the hive number is 13 or 14.
df.13or14 <- df[df$hive==13|df$hive==14,]
# 2. What are the classes of the beedata's columns hive and t_i_1?
class(df$hive)
class(df$t_i_1)
# 3. Difficult: How can you print the classes of all columns?
lapply(df, class)## Exercises about data types
# 1. Create following sequences:
# 4, 8, 12, 16, 20, 24, 28
# 10, 8, 6, 4, 2, 0
# repeat "chocolate" 10 times in a vector
# 2. Starting with the following code, divide a by b and save it as variable c:
a = "5555555"
b = 3
# 3. Starting with following code, create the string/ character vector "group3trial1.csv"
group = 3
trial = 1
## Exercises about data frames
# 1. Read the data "allForarexData.csv"
# 2. Convert Exp0_OxygenTemp and Exp1_OxygenTemp. Devide by 1000. Save the result in the same column.
# 3. What are the means of Exp0_OxygenTemp and Exp1_OxygenTemp?
# 4. Save the timestamp in the correct format. The time was measured in seconds. (origin="1970-01-01")
# 5. When was the first measurement? When was the last measurement?
# if you are unsure whether the data are chronologically sorted
# 6. Create a subset containing only data measured after 2019-03-09 09:25:52.
# 7. Difficult: Exclude columns, where all measurements are the same. Hint: there is a function called nunique
## Exercises about both data frames and data types
# Use beedata.csv
# 1. Select the lines of the bee data set where the hive number is 13 or 14.
# 2. What are the classes of the beedata's columns hive and t_i_1?
# 3. Difficult: How can you print the classes of all columns?## Exercises about data types
# 1. Create following sequences:
# 4, 8, 12, 16, 20, 24, 28
# 10, 8, 6, 4, 2, 0
# repeat "chocolate" 10 times in a vector
sequence = list(range(4, 29, 4))
print(sequence)
sequence = list(range(10, -1, -2))
print(sequence)
words = ["chocolate"] * 10
print(words)
# 2. Starting with the following code...:
a = "5555555"
b = 3
c = int(a) / b
print(c)
# 3. Starting with following code, create the filename "group3trial1.csv"
group = 3
trial = 1
filename = "group" + str(group) + "trial" + str(trial) + ".csv"
print(filename)
## Exercises about data frames
# 1. Read the data "allForarexData.csv"
# import package
import pandas as pd
# open the data as dataframe
df = pd.read_csv('/isibhv/projects/p_pangaea_proces/kriemann/python/2025_informatica/allForarexData.csv')
# use head function to see the first 5 rows of the dataframe
df.head()
# 2. Convert Exp0_OxygenTemp and Exp1_OxygenTemp. Devide by 1000. Save the result in the same column.
df['Exp0_OxygenTemp'] = df['Exp0_OxygenTemp']/1000.
df['Exp1_OxygenTemp'] = df['Exp1_OxygenTemp']/1000.
# 3. What are the means of Exp0_OxygenTemp and Exp1_OxygenTemp?
print(df['Exp0_OxygenTemp'].mean())
print(df['Exp1_OxygenTemp'].mean())
# 4. Save the timestamp in the correct format. The time was measured in seconds. (origin="1970-01-01")
df['timeStamp'] = pd.to_datetime(df['timeStamp'], unit='s', origin=pd.Timestamp('1970-01-01'))
df.head()
# 5. When was the first measurement? When was the last measurement?
# if you are unsure whether the data are chronologically sorted
df = df.sort_values(by=['timeStamp'], ascending=True)
df.head(1)
df.tail(1)
# 6. Create a subset containing only data measured after 2019-03-09 09:25:52.
df_subtime = df[df['timeStamp']>'2019-03-09 09:25:52']
df_subtime.head()
# 7. Difficult: Exclude columns, where all measurements are the same. hint: there is a function called nunique
constant_columns = [col for col in df.columns if df[col].nunique() == 1]
print("Constant columns:", constant_columns)Constant columns: []
df_sub = df.drop(constant_columns, axis=1)
## Exercises about both data frames and data types
# Use beedata.csv
# 1. Select the lines of the bee data set where the hive number is 13 or 14.
filtered_df = df[(df['hive'] == 13) | (df['hive'] == 14)]
filtered_df.head() time hive h t_i_1 ... t_i_4 t_i_5 t_o weight_kg
2 1.565081e+18 13 59.05540 22.8750 ... 21.8125 22.6875 22.8750 10.779
10 1.565081e+18 13 59.46195 22.6875 ... 21.6875 22.6250 22.8125 10.776
17 1.565082e+18 13 59.39778 22.5000 ... 21.6250 22.4375 22.6250 10.783
25 1.565082e+18 13 59.23397 22.4375 ... 21.5000 NaN 22.5000 10.781
33 1.565082e+18 13 59.55365 22.3750 ... 21.4375 22.2500 22.5000 10.779
[5 rows x 10 columns]
# 2. What are the classes of the beedata's columns hive and t_i_1?
# 3. Difficult: How can you print the classes of all columns?
# This is not at all difficult in python:
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171299 entries, 0 to 171298
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 time 171299 non-null float64
1 hive 171299 non-null int64
2 h 95156 non-null float64
3 t_i_1 155193 non-null float64
4 t_i_2 155203 non-null float64
5 t_i_3 155232 non-null float64
6 t_i_4 155148 non-null float64
7 t_i_5 133220 non-null float64
8 t_o 155142 non-null float64
9 weight_kg 111106 non-null float64
dtypes: float64(9), int64(1)
memory usage: 13.1 MB
5.4 Conditions - If and Else
5.4.0.1 If
Simple if:
a <- 19
if(a >= 18){
print("Yes, you are allowed to see this content.")
}[1] "Yes, you are allowed to see this content."
if and else:
goodWeather <- TRUE
if (goodWeather){
print("Let's go to the beach!")
} else{
print("Let's eat pizza.")
}[1] "Let's go to the beach!"
Else-if:
do.you.want.this <- "chocolate"
#do.you.want.this <- "cookies"
#do.you.want.this <- "carrots"
if (do.you.want.this == "chocolate"){
print("Yes, I love chocolate!")
} else if (do.you.want.this == "cookies"){
print("Yes, if they are with chocolate.")
} else {
print("Hm, not sure. Do you have chocolate?")
}[1] "Yes, I love chocolate!"
library(lubridate)
birthday <- as.POSIXct("2019-08-06 10:50:18")
what.do.you.want.to.know <- "year"
#what.do.you.want.to.know <- "month"
#what.do.you.want.to.know <- "chocolate"
if (what.do.you.want.to.know == "year"){
print(year(birthday))
} else if (what.do.you.want.to.know == "month"){
print(month(birthday))
} else {
print("Sorry, what do you want to know?")
}[1] 2019
5.4.0.2 If
Note: if, else, etc. do not use (),{} or [] to mark its beginning and endings. Python uses indentations.
Simple if:
a = 19
if a >= 18:
print("Yes, you are allowed to see this content.")Yes, you are allowed to see this content.
if and else:
goodWeather = True
if goodWeather:
print("Let's go to the beach!")
else:
print("Let's eat pizza.")Let's go to the beach!
elif (short for else if):
do_you_want_this = "chocolate"
#do_you_want_this = "cookies"
#do_you_want_this = "carrots"
if do_you_want_this == "chocolate":
print("Yes, I love chocolate!")
elif do_you_want_this == "cookies":
print("Yes, if they are with chocolate.")
else:
print("Hm, not sure. Do you have chocolate?")Yes, I love chocolate!
from datetime import datetime
birthday = datetime.fromisoformat('2019-08-06 10:50:18')
what_do_you_want_to_know = "year"
#what_do_you_want_to_know = "month"
#what_do_you_want_to_know = "chocolate"
if what_do_you_want_to_know == "year":
print(birthday.year)
elif what_do_you_want_to_know == "month":
print(birthday.month)
else:
print("Sorry, what do you want to know?")2019
5.5 For and While
5.5.0.1 For
ages <- c(31, 29, 5)
total.age <- 0
for(i in ages){
total.age <- total.age + i
}
total.age[1] 65
myfamily <- list(Paula=31, Ali=29, Emma=5)
age <- 0
for (i in 1:length(myfamily)){
age <- age + myfamily[[i]]
}
age[1] 65
apply it to columns:
for (i in 1:ncol(df)){
name <- names(df)[i]
colclass <- class(df[,i])
print(paste(name, ":", colclass))
}[1] "time : POSIXct" "time : POSIXt"
[1] "hive : integer"
[1] "h : numeric"
[1] "t_i_1 : numeric"
[1] "t_i_2 : numeric"
[1] "t_i_3 : numeric"
[1] "t_i_4 : numeric"
[1] "t_i_5 : numeric"
[1] "t_o : numeric"
[1] "weight_kg : numeric"
[1] "weight_g : numeric"
[1] "highTemp : logical"
transform long code …
df <- read.csv("beedata.csv")
print("mean of h:")
mean(df$h)
print("mean of t_i_1:")
mean(df$t_i_1)
print("mean of weight_kg:")
mean(df$weight_kg)
print("missing h values:")
sum(is.na(df$h))
print("missing t_i_1 values:")
sum(is.na(df$h))
print("missing weight_kg values:")
sum(is.na(df$h))… into shorter code
to.analyse <- c("h", "t_i_1", "weight_kg")
for(i in 1:length(to.analyse)){
print(paste("mean of", to.analyse[i], ":"))
print(mean(to.analyse[i]))
print(paste("missing", to.analyse[i], "values:"))
print(sum(is.na(df[,to.analyse[i]])))
}5.5.0.2 While
a <- 2
while (a<1000){
a <- a^2
print(a)
}[1] 4
[1] 16
[1] 256
[1] 65536
5.5.0.3 Do-while
a <- 10
repeat{
a <- a - sqrt(a)
print(a)
if(sqrt(a) > a){
break
}
}[1] 6.837722
[1] 4.222818
[1] 2.167869
[1] 0.6955003
bad example for Do-while:
a <- 10
repeat{
a <- a - sqrt(a)
print(a)
if(a < 0){
break
}
}[1] 6.837722
[1] 4.222818
[1] 2.167869
[1] 0.6955003
[1] -0.1384663
because it breaks for edge cases:
a <- -10
repeat{
a <- a - sqrt(a)
print(a)
if(a < 0){
break
}
}Warning in sqrt(a): NaNs produced
[1] NaN
Error in if (a < 0) {: missing value where TRUE/FALSE needed
5.5.0.4 For vs. While
- if you do not know the number of iterations needed, use while
- else, always use for!
- it is easier to create a while-loop that never stops, than creating a for-loop that never stops.
5.5.0.5 For
Note: for, while, etc. do not use (),{} or [] to mark its beginning and endings. Python uses indentations.
ages = [31, 29, 5]
total_age = 0
for i in ages:
total_age = total_age + i
print(total_age)65
same as:
ages = [31, 29, 5]
total_age = 0
for age in ages:
total_age = total_age + age
print(total_age)65
The next example uses a dictionary with {key:value} structure.
myfamily = {'Paula':31, 'Ali':29, 'Emma':5}
age = 0
for key, value in myfamily.items():
age = age + value
print(age)65
apply it to dataframes:
import pandas as pd
df = pd.read_csv('beedata.csv')
for ind,value in df['t_i_1'].items():
if ind > 6 and ind <= 10:
print('ind = ',ind)
print('value = ',value)ind = 7
value = 27.25
ind = 8
value = 27.25
ind = 9
value = 27.25
ind = 10
value = 22.6875
transform long code …
import pandas as pd
df = pd.read_csv('beedata.csv')
print("mean of h:")mean of h:
df['h'].mean()np.float64(56.397063812371265)
print("mean of t_i_1:")mean of t_i_1:
df['t_i_1'].mean()np.float64(25.847728319382963)
print("mean of weight_kg:")mean of weight_kg:
df['weight_kg'].mean()np.float64(28.43513778723021)
print("missing h values:")missing h values:
df['h'].isna().sum()np.int64(76143)
print("missing t_i_1 values:")missing t_i_1 values:
df['t_i_1'].isna().sum()np.int64(16106)
print("missing weight_kg values:")missing weight_kg values:
df['weight_kg'].isna().sum()np.int64(60193)
… into shorter code
to_analyse = ["h", "t_i_1", "weight_kg"]
for i in to_analyse:
print("mean of", i, df[i].mean())
print("sum of missing values of", i, df[i].isna().sum())mean of h 56.397063812371265
sum of missing values of h 76143
mean of t_i_1 25.847728319382963
sum of missing values of t_i_1 16106
mean of weight_kg 28.43513778723021
sum of missing values of weight_kg 60193
5.5.0.6 While
a = 2
while (a<1000):
a = a**2
print(a)4
16
256
65536
5.5.0.7 Do-while
import math
a = 10
while True:
a = a - math.sqrt(a)
print(a)
if math.sqrt(a) > a:
break6.83772233983162
4.222818452528758
2.167868708002445
0.6955003070910275
bad example for Do-while:
import math
a = 10
while True:
a = a - math.sqrt(a)
print(a)
if a < 0:
break6.83772233983162
4.222818452528758
2.167868708002445
0.6955003070910275
-0.13846630320642772
because it breaks for edge cases:
import math
a = -10
while True:
try:
a = a - math.sqrt(a)
except ValueError:
print("math domain error: sqrt(a) with negative a is not defined")
break
print(a)
if a < 0:
breakmath domain error: sqrt(a) with negative a is not defined
5.5.0.8 For vs. While
- if you do not know the number of iterations needed, use while
- else, always use for!
- it is easier to create a while-loop that never stops, than creating a for-loop that never stops.
5.6 Exercises - Conditions, For and While
# 1. Imagine somebody tells you the age of two of her friends,
# Ali and Bea. If Ali is older than Bea print
# "Ali is older than Bea.". If Bea is older than Ali,
# print "Bea is older than Ali.".
Ali <- 5555/111
Bea <- 10*5
## For and While
# 1. Loop from 1 to 15 and print the square-root of
# each number
# 2. The function "rbinom(1,1,0.9)" models a very unfair
# coin, which shows the head (1) with a probability
# of 90 %. Flip this coin until you get tail (0).
# Count how many times you flipped the coin.
# 3. Repeat each character of the string "Summer" 3 times.
# Combine all characters to a new string.
# hint at the bottom of this file
# 4. Difficult: Read in all files in the directory
# "forarexData" and combine them to a single data frame.
# hint at the bottom of this file
# Hints
# Watch out, spoiler alert!
# hint "For and While", 3: Check out the functions "substr" and "nchar".
# hint "For and While", 4: Useful functions: list.files, try or try-catch, rbind
# hint "Functions" 1: in case you did not find the solution. This is the solution:
# Watch out this is a solution for another task!
s <- "summer"
s1 <- ""
for(i in 1:nchar(s)){
for(j in 1:3){
s1 <- paste(s1, substr(s, i, i), sep = "")
}
}
s1## Exercises - Conditions
# 1. Imagine somebody tells you the age of two of her friends,
# Ali and Bea. If Ali is older than Bea print
# "Ali is older than Bea.". If Bea is older than Ali,
# print "Bea is older than Ali.".
Ali <- 5555/111
Bea <- 10*5
if (Ali>Bea){
print("Ali is older than Bea.")
} else{
print("Bea is older than Ali.")
}
# 2. The price of a small bar of chocolate is 1 Euro.
# For a medium bar of chocolate you have to pay 3 Euros
# and for a large bar you have to pay 5 Euros.
# Somebody tells you how much money she has.
# Tell her, how much chocolate she can buy
# (Eg. print "You can buy a small bar of chocolate.")
# She does not want to buy more than one bar of chocolate,
# but the largest possible.
money <- 14/3
if (money >= 5){
print("You can buy a large bar of chocolate.")
} else if (money >= 3){
print("You can buy a medium bar of chocolate.")
} else if (money >= 1){
print("You can buy a small bar of chocolate.")
} else{
print("Sorry, you can't buy any chocolate.")
}
## For and While
# 1. Loop from 1 to 15 and print the square-root of
# each number
for (i in 1:15){
print(sqrt(i))
}
# 2. The function "rbinom(1,1,0.9)" models a very unfair
# coin, which shows the head (1) with a probability
# of 90 %. Flip this coin until you get tail (0).
# Count how many times you flipped the coin.
coin <- 1
index <- 0
while (coin!=0){
coin <- rbinom(1,1,0.9)
index <- index +1
}
index
# or
index <- 0
repeat{
coin <- rbinom(1,1,0.9)
index <- index +1
if(coin==0){
break
}
}
index
# 3. Repeat each character of the string "Summer" 3 times.
# Combine all characters to a new string.
# hint: Check out the functions "substr" and "nchar".
s <- "summer"
s1 <- ""
for(i in 1:nchar(s)){
for(j in 1:3){
s1 <- paste(s1, substr(s, i, i), sep = "")
}
}
s1
# 4. Difficult: Read in all files in the directory
# "forarexData" and combine them to a single data frame.
# hint: Useful functions: list.files, try or try-catch, rbind
dir <- "forarexData/"
files <- list.files(dir)
files <- paste(dir, files, sep="")
df<-data.frame()
for (i in 1:length(files)) {
tmp <- read.csv(file = files[i])
df <- rbind(df,tmp)
}# 1. Imagine somebody tells you the age of two of her friends,
# Ali and Bea. If Ali is older than Bea, print
# "Ali is older than Bea.". If Bea is older than Ali,
# print "Bea is older than Ali.".
Ali = 5555/111
Bea = 10*5
# 2. The price of a small bar of chocolate is 1 Euro.
# For a medium bar of chocolate you have to pay 3 Euros
# and for a large bar you have to pay 5 Euros.
# Somebody tells you how much money she has.
# Tell her, how much chocolate she can buy
# (Eg. print "You can buy a small bar of chocolate.")
# She does not want to buy more than one bar of chocolate,
# but the largest possible.
money = 14/3
## For and While
# 1. Loop from 1 to 15 and print the square-root of
# each number
# hint: you need the package numpy for square-roots
import numpy as np
# 2. The function "np.random.binomial(n=1, p=0.9, size=1)" models a very unfair
# coin, which shows the head (1) with a probability
# of 90 %. Flip this coin until you get tail (0).
# Count how many times you flipped the coin.
# 3. Repeat each character of the string "Summer" 3 times.
# Combine all characters to a new string.
# hint at the bottom of this file
# 4. Difficult: Read in all files in the directory
# "forarexData" and combine them to a single data frame.
# hint at the bottom of this file
# Hints
# Watch out, spoiler alert!
# hint "For and While", 3: The function "len" works also for strings.
# hint "For and While", 4: useful functions: os.listdir, os.path.join(), pd.concat
import os
import pandas as pd## Exercises - Conditions
# 1. Imagine somebody tells you the age of two of her friends,
# Ali and Bea. If Ali is older than Bea, print
# "Ali is older than Bea.". If Bea is older than Ali,
# print "Bea is older than Ali.".
Ali = 5555/111
Bea = 10*5
if Ali > Bea:
print("Ali is older than Bea.")
else:
print("Bea is older than Ali.")
# 2. The price of a small bar of chocolate is 1 Euro.
# For a medium bar of chocolate you have to pay 3 Euros
# and for a large bar you have to pay 5 Euros.
# Somebody tells you how much money she has.
# Tell her, how much chocolate she can buy
# (Eg. print "You can buy a small bar of chocolate.")
# She does not want to buy more than one bar of chocolate,
# but the largest possible.
money = 14/3
if money >= 5:
print("You can buy a large bar of chocolate.")
elif money >= 3:
print("You can buy a medium bar of chocolate.")
elif money >= 1:
print("You can buy a small bar of chocolate.")
else:
print("Sorry, you can't buy any chocolate.")
## For and While
# 1. Loop from 1 to 15 and print the square-root of
# each number
# you need the package numpy to compute square-roots
import numpy as np
for i in range(1,16):
print(np.sqrt(i))
# 2. The function "np.random.binomial(n=1, p=0.9, size=1)" models a very unfair
# coin, which shows the head (1) with a probability
# of 90 %. Flip this coin until you get tail (0).
# Count how many times you flipped the coin.
coin = 1
index = 0
while coin!=0:
coin = np.random.binomial(n=1, p=0.9, size=1)
index = index +1
print(index)
# 3. Repeat each character of the string "Summer" 3 times.
# Combine all characters to a new string.
s = "summer"
s1 = ""
for i in range(len(s)):
for j in range(3):
s1 += s[i]
print(s1)
# 4. Difficult: Read in all files in the directory
# "forarexData" and combine them to a single data frame.
# hint: Useful functions: os.listdir, os.path.join(), pd.concat
import os
import pandas as pd
dir = "forarexData/"
files = os.listdir(dir)
files = [os.path.join(dir, f) for f in files]
df = pd.DataFrame()
for file in files:
tmp = pd.read_csv(file)
df = pd.concat([df, tmp], ignore_index=True)
print(df)5.7 Writing own functions
The basic syntax to write functions looks like this:
myPlus <- function(a, b){
return(a + b)
}
myPlus(1,2)[1] 3
You can use default parameters:
myPlus <- function(a=1, b=1){
return(a + b)
}
myPlus(1,2)[1] 3
myPlus()[1] 2
If you want to return multiple values, you can use a named list:
dataframe.info <- function(df){
cells.count <- ncol(df)*nrow(df)
return(list(columns=ncol(df), rows=nrow(df), cells= cells.count))
}
dataframe.info(df)$columns
[1] 12
$rows
[1] 171299
$cells
[1] 2055588
The basic syntax to write functions looks like this:
def myPlus(a, b):
c = a+b
return c
myPlus(1,2)3
You can use default parameters:
def myPlus(a=1, b=1):
c = a+b
return c
myPlus(1,2)3
myPlus()2
Example to return multiple values:
def dataframe_info(df):
rows_count = df.shape[0]
columns_count = df.shape[1]
values_count = df.shape[0]*df.shape[1]
return columns_count, rows_count, values_count
import pandas as pd
df = pd.read_csv('beedata.csv')
dataframe_info(df)(10, 171299, 1712990)
5.8 Exercises: Functions
# 1. In one of the previous tasks you wrote some code to
# Repeat each character of the string "Summer" 3 times.
# Convert it into a function. It should take two parameters:
# - a string, e.g. “summer”.
# - A number, e.g. 3 to specify how often each letter
# should be repeated.
# Use “summer” and 3 as default values.
# hint at the bottom of this file
# 2. We look at the data from the beehives again.
# Given a beehive, how many observations do we have
# for that beehive? How many observations are missing
# in the column “weight_kg”? Write a function for this
# problem. Return both values in a named list.
# Use the dataframe and the hive number as input parameters.
# 3. Difficult-follow-up: In a previous exercise, you wrote some code to read
# in all files within a directory.
# Convert the code to a function.
# Hints
# hint "Functions" 1: in case you did not find the solution. This is the solution:
# Watch out this is a solution for another task!
s <- "summer"
s1 <- ""
for(i in 1:nchar(s)){
for(j in 1:3){
s1 <- paste(s1, substr(s, i, i), sep = "")
}
}
s1# 1. In one of the previous tasks you wrote some code to
# Repeat each character of the string "Summer" 3 times.
# Convert it into a function. It should take two parameters:
# - a string, e.g. “summer”.
# - A number, e.g. 3 to specify how often each letter
# should be repeated.
# Use “summer” and 3 as default values.
# hint: in case you did not find the solution. This is the solution:
# Watch out this is a solution for another task!
s <- "summer"
s1 <- ""
for(i in 1:nchar(s)){
for(j in 1:3){
s1 <- paste(s1, substr(s, i, i), sep = "")
}
}
s1
funny.words <- function(word=summer, repeat.times=3) {
s1 <- ""
for(i in 1:nchar(word)){
for(j in 1:repeat.times){
s1 <- paste(s1, substr(word, i, i), sep = "")
}
}
return(s1)
}
# 2. We look at the data from the beehives again.
# Given a beehive, how many observations do we have
# for that beehive? How many observations are missing
# in the column “weight_kg”? Write a function for this
# problem. Return both values in a named list.
# Use the dataframe and the hive number as input parameters.
df <- read.csv("beedata.csv")
beehive.info <- function(df, hive){
sub <- df[df$hive==hive,]
n <- nrow(sub)
missing <- sum(is.na(sub$weight_kg))
return(list(n=n, missing=missing))
}
beehive.info(df, 4)
beehive.info(df, 13)
# 3. In a previous exercise, you wrote some code to read
# in all files within a directory.
# Convert the code to a function.
combine <- function(dir="forarexData/", finalName="forarexCombined.csv"){
# Purpose: the function can combine the files and save to a new csv
# Parameters:
# dir: the directory of the files to be combined
# finalName: the name of the file to be used to save the csv
# list all files in directory
files <- list.files(dir)
files <- paste(dir, files, sep="")
df<-data.frame()
for (i in 1:length(files)) {
tmp <- read.csv(file = files[i])
df <- rbind(df,tmp)
}
write.csv(df, finalName)
}
combine()
### Other option with try
combine.try <- function(dir="forarexData/", finalName="forarexCombined.csv"){
# Purpose: the function can combine the files and save to a new csv
# Parameters:
# dir: the directory of the files to be combined
# finalName: the name of the file to be used to save the csv
# list all files in directory
files <- list.files(dir)
files <- paste(dir, files, sep="")
df<-data.frame()
for (i in 1:length(files)) {
try({
tmp <- read.csv(file = files[i])
df <- rbind(df,tmp)
})
}
write.csv(df, finalName)
}
combine.try()
# option with trycatch, in case some files are completly screwed up
combine.trycatch <- function(dir="forarexData/", finalName="forarexCombined.csv"){
# Purpose: the function can combine the files and save to a new csv
# Parameters:
# dir: the directory of the files to be combined
# finalName: the name of the file to be used to save the csv
# list all files in directory
files <- list.files(dir)
files <- paste(dir, files, sep="")
df<-data.frame()
for (i in 1:length(files)) {
tryCatch({
tmp <- read.csv(file = files[i])
df <- rbind(df,tmp)
},
warning = function(warning_condition){
print(paste("Some weird formatting in file", files[i]))
print(warning_condition)
},
# handling of errors, if there was some wrong formating
error = function(error_condition){
print(paste("Skipping file", files[i]))
print(error_condition)
})
}
write.csv(df, finalName)
}
combine.trycatch()# 1. In one of the previous tasks you wrote some code to
# repeat each character of the string "Summer" 3 times.
# Convert it into a function. It should take two parameters:
# - a string, e.g. “summer”.
# - A number, e.g. 3 to specify how often each letter
# should be repeated.
# Use “summer” and 3 as default values.
# hint at the bottom of this file
# 2. We look at the data from the beehives again.
# Given a beehive, how many observations do we have
# for that beehive? How many observations are missing
# in the column “weight_kg”? Write a function for this
# problem. Return both values in a named list.
# Use the dataframe and the hive number as input parameters.
# 3. Difficult-follow-up: In a previous exercise, you wrote some code to read
# in all files within a directory.
# Convert the code to a function.
# Hints
# hint "Functions" 1: in case you did not find the solution. This is the solution:
# Watch out this is a solution for another task!
s = "summer"
s1 = ""
for i in range(len(s)):
for j in range(3):
s1 += s[i]
print(s1)sssuuummmmmmeeerrr
# 1. In one of the previous tasks you wrote some code to
# Repeat each character of the string "Summer" 3 times.
# Convert it into a function. It should take two parameters:
# - a string, e.g. “summer”.
# - A number, e.g. 3 to specify how often each letter
# should be repeated.
# Use “summer” and 3 as default values.
# hint: in case you did not find the solution. This is the solution:
# Watch out this is a solution for another task!
s = "summer"
s1 = ""
for i in range(len(s)):
for j in range(3):
s1 += s[i]
print(s1)
def funny_words(word="summer", repeat_times=3):
s1 = ""
for i in range(len(word)):
for j in range(repeat_times):
s1 += word[i]
return s1
# 2. We look at the data from the beehives again.
# Given a beehive, how many observations do we have
# for that beehive? How many observations are missing
# in the column “weight_kg”? Write a function for this
# problem. Return both values in a named list.
# Use the dataframe and the hive number as input parameters.
import pandas as pd
df = pd.read_csv("beedata.csv")
def beehive_info(df, hive):
sub = df[df['hive'] == hive]
n = len(sub)
missing = sub['weight_kg'].isna().sum()
return {'n': n, 'missing': missing}
# examples
result_4 = beehive_info(df, 4)
result_13 = beehive_info(df, 13)
print(result_4)
print(result_13)
# 3. In a previous exercise, you wrote some code to read
# in all files within a directory.
# Convert the code to a function.
import os
import pandas as pd
def combine(dir="forarexData/", final_name="forarexCombined.csv"):
# list all files in directory
files = os.listdir(dir)
files = [os.path.join(dir, f) for f in files]
# crreate new and empty dataframe
df = pd.DataFrame()
for file in files:
tmp = pd.read_csv(file)
df = pd.concat([df, tmp], ignore_index=True)
df.to_csv(final_name, index=False)
# call function
combine()5.9 Debugging
Here is a buggy version of the function “funny.words.”
funny.words <- function(s = "summer", count = 3){
s = "summer"
s1 <- ""
for(i in 1:2){
for(i in 1:count){
print(substr(s, i, i))
s1 <- paste(s1, substr(s, i, i), sep = "")
}
}
return(s1)
}
funny.words()[1] "s"
[1] "u"
[1] "m"
[1] "s"
[1] "u"
[1] "m"
[1] "sumsum"
funny.words("banana", 5)[1] "s"
[1] "u"
[1] "m"
[1] "m"
[1] "e"
[1] "s"
[1] "u"
[1] "m"
[1] "m"
[1] "e"
[1] "summesumme"
Since this code consists of a function and a loop, we cannot run line by line to find the mistakes. However, we can create breakpoints by clicking next to the line numbering (left).
Then we can click on “source”” to execute all code until the first breakpoint.
In the console, there will be a new set of buttons:
- If you click on Next, R will execute the next line. (e.g. line 6)
- If you click on the second button, R will step in the current function call, so it will basically jump into an other function. (e.g. into the print function)
- If you click on the third button, R will execute the rest of the current function or loop. (e.g. line 6 and 7)
- If you click on “continue”“, R will run until we come across the next breakpoint. (e.g. in the next round of the loop or in line 10)
- If you click on “Stop”, R will exit the debug mode.
Here is a buggy version of the function “funny_words.”. We are using the package ipdb for debugging.
Implement ipdb.set_trace() into the function to set a Breakpoint in the execution of the loop. You can step through the code and print out variables.
# import ipdb
def funny_words(s = "summer", count = 3):
s = "summer"
s1 = ""
for i in range(2):
for i in range(count):
# ipdb.set_trace() # Pause the execution to inspect 's', 'i' and 'i'
print(s, i, i)
s1 =[s, i, i]
return s1try out:
funny_words()summer 0 0
summer 1 1
summer 2 2
summer 0 0
summer 1 1
summer 2 2
['summer', 2, 2]
and
funny_words("banana", 5)summer 0 0
summer 1 1
summer 2 2
summer 3 3
summer 4 4
summer 0 0
summer 1 1
summer 2 2
summer 3 3
summer 4 4
['summer', 4, 4]
You can use the following debugging commands to e.g. print variables or continue the loop until the next breakpoint:
- p
- c: continue to the next breakpoint
- n: move to the next line of code
- q: quit the debugging session
5.10 Piping
Some people complain about all the brackets in the R syntax. With using pipes you can get rid of them. Pipes might also reflect your workflow more naturally than the usual syntax.
You can find the standard pipe operator in multiple packages, but you will get more options when using the magrittr packages.
Here is a simple example
library(magrittr)
x <- 9
# Calculate the square-root of x
sqrt(x)[1] 3
# Calculate it using pipes
x %>% sqrt[1] 3
In case you want to update x, you can use:
x <- 9
# Calculate the square-root of x and update x
x <- sqrt(x)
x[1] 3
# Calculate it using pipes
x <- 9
x %<>% sqrt
x[1] 3
This is an example with two functions and additional arguments:
nrow(subset(df, hive==4))[1] 60193
df %>% subset(hive==4) %>% nrow[1] 60193
There is no equivalent of piping in python.
6 PANGAEA - downloading data
7 Data Cleaning
8 Statistics
8.1 First Step: Look at your data (Descriptive Statistics)
- What is your response variable?
- Plot your data
- Plot raw data
- Frequency distribution -> Are there any patterns?
8.2 Definition - Simple Linear Regression
\[ \begin{equation} y_i = \beta_0 + \beta_1x_i + \epsilon_i, \qquad i=1,...,n \end{equation} \] where \(y\) is the dependent variable, \(x\) is the independent variable (also called explanatory variable), \(\beta_0\) is the intercept parameter, \(\beta_1\) is the slope parameter and \(\epsilon\sim N(0,\sigma^2)\) is the error coefficient.
8.3 Sample data
set.seed(123)
x <- seq(0,5,0.1)
y <- x + rnorm(length(x))
plot(x, y)hist(y)8.4 Fitting a Linear Model
mod <- lm(y~x)
summary(mod)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.0116 -0.6110 -0.0912 0.6575 2.1444
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.05829 0.25565 0.228 0.821
x 0.99216 0.08812 11.259 3.4e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9263 on 49 degrees of freedom
Multiple R-squared: 0.7212, Adjusted R-squared: 0.7155
F-statistic: 126.8 on 1 and 49 DF, p-value: 3.397e-15
plot(x,y)#abline(mod)
library(ggplot2)
ggplot(mapping= aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)`geom_smooth()` using formula = 'y ~ x'
8.5 Overall check for assumptions
plot(mod)Produces series of graphs (first two are the most important)
8.5.1 1st graph: Constancy of variance
Plot of the residuals against the fitted values
→ Should look like the sky at night
8.5.2 2nd graph: Normality of errors
Normal QQ plot = plot of the ordered residuals against the fitted values → Should be a reasonably straight line
8.6 Linear Model - Assumptions
- Linear relationship between x and y
- Independence of residuals
- Constant variance of residuals
- Normality: Residuals are normally distributed
8.7 Assumption 1 - Linear Relationship
- use a scatterplot to check
- if this assumption is violated we have unrealistic predictions and estimates of the standard deviation
- How to deal with this? You can transform the variables or alter the model.
plot(x,y)8.8 Assumption 2 - Independence of Residuals
- Plot residuals in time/ observation order / based on observation location / …
- You expect that they are equally scattered around 0
- You can also use the Durban-Watson-Test
- Autocorrelation can have serious effects on the model
- You can add an AR correlation term to the model. Sometimes it also helps to add a further covariate.
res <- residuals(mod)
plot(x, res)8.9 Durban-Watson-Test
H0: There is no correlation among the residuals.
HA: The residuals are autocorrelated.
library(car)Loading required package: carData
durbinWatsonTest(mod) lag Autocorrelation D-W Statistic p-value
1 0.02431999 1.940949 0.712
Alternative hypothesis: rho != 0
- The p-value is larger than 0.05, so we cannot reject the null hypothesis that there is no correlation between residuals.
8.10 Assumption 3 - Constant variance or errors
- Plot fitted values vs residuals
- You expect that they are equally scattered around 0
plot(fitted(mod),res)8.11 Assumption 4 - Normality of Residuals
- Use qq-plot to compare quantiles.
- Or use the Shapiro-Wilk-Test
- If you do not find a Normal Distribution, check for outliers or transform your variable.
qqPlot(res)[1] 44 18
8.12 Shapiro-Wilk-Test for Normality
- H0: The data is normally distributed
- HA: The data comes from an other distribution
shapiro.test(res)
Shapiro-Wilk normality test
data: res
W = 0.99152, p-value = 0.9732
- We cannot reject the null-hypothesis that the data is normally distributed.
8.13 If model doesn’t seem to fit your data:
8.13.1 Is your response really continuous?
If not another model might be more appropriate: - Continuous: linear model → LM e.g. height, weight - Count: Poisson model → GLM e.g. number of individuals - Binary: Binomial model → GLM e.g. presence / absence
8.13.2 Are there obvious patterns?
For example many zeros: You could separate in two analysis: Binomial model for success / failure and linear model for successes
8.13.3 Are there outliers that should be omitted?
4th graph of plot(model): Plot of Cook´s distances versus raw data → Highlights the identity of particularly influential data points
8.13.4 Relationship not a straight line?
Polynomial regression e.g. a quadratic function might be an option
lm(y ~ x + I(x^2))8.13.5 Transformation
You could for example use a log transformation of the response variable
lm(log(y) ~ x)8.13.6 Are there other explanatory variables that you should include?
SHOW ON PENGUIN EXAMPLE (include species as categorical variable)
8.14 Definition - Linear Model with multiple predictors
\[\begin{equation}\label{eqn:linearregression} y_i = \beta_0 + \beta_1x_{1,i} + ... + \beta_px_{p,i} + \epsilon_i, \qquad i=1,...,n \end{equation}\] where \(y\) is the dependent variable, \(x_1 ... x_p\) are the independent variables (also called explanatory variables), \(\beta_0 ... \beta_p\) are the regression coefficients, \(\epsilon\sim N(0,\sigma^2)\) is the error coefficient and \(p \geq 1\).
8.15 The penguin data set
Artwork by @allison_horst
Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.
https://allisonhorst.github.io/palmerpenguins/
library(palmerpenguins)
Attaching package: 'palmerpenguins'
The following objects are masked from 'package:datasets':
penguins, penguins_raw
head(penguins)# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
https://github.com/mcnakhaee/palmerpenguins?tab=readme-ov-file
from palmerpenguins import load_penguins/home/diseng001/R/x86_64-pc-linux-gnu-library/4.5/reticulate/python/rpytools/loader.py:120: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
return _find_and_load(name, import_)
penguins = load_penguins()
penguins.head() species island bill_length_mm ... body_mass_g sex year
0 Adelie Torgersen 39.1 ... 3750.0 male 2007
1 Adelie Torgersen 39.5 ... 3800.0 female 2007
2 Adelie Torgersen 40.3 ... 3250.0 female 2007
3 Adelie Torgersen NaN ... NaN NaN 2007
4 Adelie Torgersen 36.7 ... 3450.0 female 2007
[5 rows x 8 columns]
8.16 Exercise
Using the penguin data set, fit a linear model to determine the relationship between the body mass index and the flipper length.
The flipper length differs accross penguin species. How could you add this to your model?
8.16.1 Solution
Remove missing values:
library(dplyr)
Attaching package: 'dplyr'
The following object is masked from 'package:car':
recode
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
penguins_filtered <- penguins %>%
filter(complete.cases(.))Fit a linear model:
y <- penguins_filtered$body_mass_g
x <- penguins_filtered$flipper_length_mm
mod <- lm(y~x)
summary(mod)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-1057.33 -259.79 -12.24 242.97 1293.89
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5872.09 310.29 -18.93 <2e-16 ***
x 50.15 1.54 32.56 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 393.3 on 331 degrees of freedom
Multiple R-squared: 0.7621, Adjusted R-squared: 0.7614
F-statistic: 1060 on 1 and 331 DF, p-value: < 2.2e-16
Check linear relationship:
plot(x,y)Check independence of residuals
res <- residuals(mod)
plot(x, res)library(car)
durbinWatsonTest(mod) lag Autocorrelation D-W Statistic p-value
1 -0.06161287 2.115873 0.358
Alternative hypothesis: rho != 0
Check constant variance or errors
plot(fitted(mod),res)Check Normality of Residuals
qqPlot(res)[1] 35 164
hist(res)shapiro.test(res)
Shapiro-Wilk normality test
data: res
W = 0.99237, p-value = 0.08633
Fit a new model to consider penguin species as well.
mod_add <- lm(penguins_filtered$body_mass_g ~ penguins_filtered$flipper_length_mm + penguins_filtered$species)
mod_spec <- lm(penguins_filtered$body_mass_g ~ penguins_filtered$flipper_length_mm * penguins_filtered$species)
summary(mod_spec)
Call:
lm(formula = penguins_filtered$body_mass_g ~ penguins_filtered$flipper_length_mm *
penguins_filtered$species)
Residuals:
Min 1Q Median 3Q Max
-900.90 -247.01 -25.53 197.19 1143.33
Coefficients:
Estimate
(Intercept) -2508.088
penguins_filtered$flipper_length_mm 32.689
penguins_filtered$speciesChinstrap -529.108
penguins_filtered$speciesGentoo -4166.117
penguins_filtered$flipper_length_mm:penguins_filtered$speciesChinstrap 1.884
penguins_filtered$flipper_length_mm:penguins_filtered$speciesGentoo 21.477
Std. Error
(Intercept) 892.413
penguins_filtered$flipper_length_mm 4.692
penguins_filtered$speciesChinstrap 1525.110
penguins_filtered$speciesGentoo 1431.581
penguins_filtered$flipper_length_mm:penguins_filtered$speciesChinstrap 7.864
penguins_filtered$flipper_length_mm:penguins_filtered$speciesGentoo 6.967
t value
(Intercept) -2.810
penguins_filtered$flipper_length_mm 6.967
penguins_filtered$speciesChinstrap -0.347
penguins_filtered$speciesGentoo -2.910
penguins_filtered$flipper_length_mm:penguins_filtered$speciesChinstrap 0.240
penguins_filtered$flipper_length_mm:penguins_filtered$speciesGentoo 3.083
Pr(>|t|)
(Intercept) 0.00525
penguins_filtered$flipper_length_mm 1.78e-11
penguins_filtered$speciesChinstrap 0.72887
penguins_filtered$speciesGentoo 0.00386
penguins_filtered$flipper_length_mm:penguins_filtered$speciesChinstrap 0.81077
penguins_filtered$flipper_length_mm:penguins_filtered$speciesGentoo 0.00223
(Intercept) **
penguins_filtered$flipper_length_mm ***
penguins_filtered$speciesChinstrap
penguins_filtered$speciesGentoo **
penguins_filtered$flipper_length_mm:penguins_filtered$speciesChinstrap
penguins_filtered$flipper_length_mm:penguins_filtered$speciesGentoo **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 368.4 on 327 degrees of freedom
Multiple R-squared: 0.7938, Adjusted R-squared: 0.7906
F-statistic: 251.7 on 5 and 327 DF, p-value: < 2.2e-16
anova(mod_add, mod_spec)Analysis of Variance Table
Model 1: penguins_filtered$body_mass_g ~ penguins_filtered$flipper_length_mm +
penguins_filtered$species
Model 2: penguins_filtered$body_mass_g ~ penguins_filtered$flipper_length_mm *
penguins_filtered$species
Res.Df RSS Df Sum of Sq F Pr(>F)
1 329 45843144
2 327 44391669 2 1451475 5.346 0.005193 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model with the interaction term between flipper length and species improves the model.
9 Plotting
9.1 Standard library
# subset data
df.4 <- df[df$hive==4,]
# plot temperature outside
plot(df.4$time, df.4$t_o, ylim=c(0,40),type = 'p', pch=4)
# choose colours
cl <- rainbow(5)
# choose colums
cols <- 4:8
# plot each column
for (i in 1:5){
lines(df.4$time, df.4[,cols[i]],col = cl[i],type = 'p', pch=4, ylim=c(0,40))
}
# add legend
legend("topright", legend=c(1, 2, 3, 4, 5, "outside"),
col=c(cl, "black"), pch = 4, lty = 0, cex=0.8)- “p”: Points
- “l”: Lines
- “b”: Both
Source: http://www.sthda.com/english/wiki/line-types-in-r-lty ### Point types (pch)
Source: http://www.sthda.com/english/wiki/r-plot-pch-symbols-the-different-point-shapes-available-in-r
# subset beedata.csv
df_4 = df[df['hive']==4]
# plot temperature outside
ax = df_4.plot(x='time', y='t_o', kind='scatter', marker='x')
ax.set_ylim(0, 40)The standard plotting of pandas is rather limited. To do the plot shown under R, you need to use Matplotlib.
9.2 R: ggplot and Python: Matplotlib
You can do similar plots in R and python. Widely used packages are ggplot for R and Matplotlib for python. However, the syntax differs sometimes greatly, so that comparing plotting routines is not really useful.
# subset data
df.4 <- df[df$hive==4,]
# choose columns
df.4.cols <- df.4[,c(1,4:9)]
# reshape data
library(reshape)
Attaching package: 'reshape'
The following object is masked from 'package:dplyr':
rename
The following object is masked from 'package:lubridate':
stamp
mdf <- melt(df.4.cols, id=c("time"))
# plot data
library(ggplot2)
ggplot(data = mdf, aes(x=time, y=value)) + geom_line(aes(colour=variable)) + ylim(c(0, 40))import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# convert time from ns to datetime format for the axis label
df['time'] = pd.to_datetime(df['time'], unit='ns', origin=pd.Timestamp('1970-01-01'))
# Subset data
df_4 = df[df['hive'] == 4]
# colors similar to rainbow(5) in R
colors = plt.cm.rainbow(np.linspace(0, 1, 5))
# select columns
cols = df_4.columns[3:8]
plt.figure(figsize=(10, 6))
# plot temperature outside
plt.scatter(df_4['time'], df_4['t_o'], color='black', marker='x', label='outside')
# plot all other temperatures
for i, col in enumerate(cols):
plt.scatter(df_4['time'], df_4[col], color=colors[i], marker='x', label=str(i+1))
plt.ylim(0, 40)
plt.legend(loc='upper right', fontsize=8)
plt.xlabel('time')
plt.ylabel('Temperature')
plt.show()9.3 Exercises - Plotting and more
## Plotting
# 1. Choose a set of data from your background.
# Decide for one of the presented plotting-libraries.
# Create a plot and add a title and custom x and y labels.
# In case you plotted multiple lines, add a legend.
# Difficult: Try to find out, how you can change the fontsize of the axis-labels.
# 2. Automatically save your plot to a given destination
# 3. Try to make your code more abstract and useful.
# Put your code within a function.
# - use the file-name (destination) as a parameter
# - In case you have multiple covariate,
# use the covariate(s) to be plotted as input parameter
# - In case it makes sense to subset your data, introduce
# a parameter for the choice of the subset.
# Always have in mind: if something goes wrong, you can use debugging.
# Exercises - Piping
# 1. Rewrite the following code using %>% and %<>%:
x <- 2
round(log(x))
# 2. Rewrite the second line of following code:
x <- rnorm(10,100)
round(sum(sqrt(x)), 3)
## Loops and functions again
# If you have some code you wrote in the past
# for some exercise, for your studies, for your work...
# See if you can find pieces of code which you
# just copied and pasted multiple times (don't worry, this is normal)
# Try to transform the code using loops and functions## Exercises - Piping
# 1. Rewrite the following code using %>% and %<>%:
x <- 2
round(log(x))
x <- 2
x %<>% log %>% round
# 2. Rewrite the second line of following code:
x <- rnorm(10,100)
round(sum(sqrt(x)), 3)
x %>% sqrt %>% sum %>% round(3)
## Plotting
# 1. Choose a set of data from your background.
# Decide for one of the presented plotting-libraries.
# Create a plot and add a title and custom x and y labels.
# In case you plotted multiple lines, add a legend.
# Try to find out, how you can change the fontsize of the axis-labels.
# 2. Automatically save your plot to a given destination
# 3. Try to make your code more abstract and useful.
# Put your code within a function.
# - use the file-name (destination) as a parameter
# - In case you have multiple covariate,
# use the covariate(s) to be plotted as input parameter
# - In case it makes sense to subset your data, introduce
# a parameter for the choice of the subset.
# Always have in mind: if something goes wrong, you can use debugging.
## Loops and functions again
# If you have some code you wrote in the past
# for some exercise, for your studies, for your work...
# See if you can find pieces of code which you
# just copied and pasted multiple times (don't worry, this is normal)
# Try to transform the code using loops and functions## Plotting
# 1. Choose a set of data from your background.
# Decide for one of the presented plotting-libraries.
# Create a plot and add a title and custom x and y labels.
# In case you plotted multiple lines, add a legend.
# Difficult: Try to find out, how you can change the fontsize of the axis-labels.
# 2. Automatically save your plot to a given destination
# 3. Try to make your code more abstract and useful.
# Put your code within a function.
# - use the file-name (destination) as a parameter
# - In case you have multiple covariate,
# use the covariate(s) to be plotted as input parameter
# - In case it makes sense to subset your data, introduce
# a parameter for the choice of the subset.
# Always have in mind: if something goes wrong, you can use debugging.
# Exercises - Piping
=> There is no piping in python!
## Loops and functions again
# If you have some code you wrote in the past
# for some exercise, for your studies, for your work...
# See if you can find pieces of code which you
# just copied and pasted multiple times (don't worry, this is normal)
# Try to transform the code using loops and functions## Plotting
# 1. Choose a set of data from your background.
# Decide for one of the presented plotting-libraries.
# Create a plot and add a title and custom x and y labels.
# In case you plotted multiple lines, add a legend.
# Difficult: Try to find out, how you can change the fontsize of the axis-labels.
# 2. Automatically save your plot to a given destination
# 3. Try to make your code more abstract and useful.
# Put your code within a function.
# - use the file-name (destination) as a parameter
# - In case you have multiple covariate,
# use the covariate(s) to be plotted as input parameter
# - In case it makes sense to subset your data, introduce
# a parameter for the choice of the subset.
# Always have in mind: if something goes wrong, you can use debugging.
# Exercises - Piping
=> There is no piping in python!
## Loops and functions again
# If you have some code you wrote in the past
# for some exercise, for your studies, for your work...
# See if you can find pieces of code which you
# just copied and pasted multiple times (don't worry, this is normal)
# Try to transform the code using loops and functions10 Markdown and R-Markdown
Markdown is a great way to document your code and to write reports about your results. Here, we describe R-Markdown in detail. However, you can also use python. Especially easy to work with are jupyter notebooks, which can switch the formatting of individual cells between code and markdown. See this documentation for details.
10.1 Output formats
The three most common output options for R-Markdown are html, pdf and word documents. If you want to create pdf documents, tex needs to be installed.
Depending on your choice, your document will have a different output parameter in the header:
# html
---
title: "R Markdown Example"
author: "Somebody"
date: "August 14, 2019"
output: html_document
---# word
---
title: "R Markdown Example"
author: "Somebody"
date: "August 14, 2019"
output: word_document
---There are many formatting options which you can specify in the header. This might be useful options for html:
---
title: "R Markdown Example"
author: "Somebody"
date: "August 14, 2019"
output:
html_document:
toc: true # print a table of content
toc_depth: 2 # maximal depth of the table of content
number_sections: true # automatically number the sections
code_folding: hide # have buttons to show/ hide code
theme: united # specify a theme
toc_float: true # generate a navigation bar
---10.2 Document structure
If you create a new R-Markdown document, R will automatically create a sample document. You can see that “#” can be used to create headers:
# Header 1
## Header 2
### Header 310.3 Code blocks
For code blocks you have many options, too. The two most important are:
- eval: If set to true, the code will be executed
- echo: If set to true, the output will be printed/ plotted
10.4 Lists
* unordered list
+ sub-item 1
+ sub-item 2
- sub-sub-item 1
* item 2
1. ordered list
2. item 2
i) sub-item 1- unordered list
- sub-item 1
- sub-item 2
- sub-sub-item 1
- item 2
- ordered list
- item 2
- sub-item 1
10.5 Insert an image
10.6 Further documentation
A good documentation of R-Markdown can be found here: https://bookdown.org/yihui/rmarkdown/
And here is a cheat sheet: https://www.rstudio.com/resources/cheatsheets/#rmarkdown
There is a lot more you can do with R-Markdown, see: https://rmarkdown.rstudio.com/gallery.html
11 Assignment
- you all get a confirmation of participation
- you can submit a small assignment for 1 CP
11.1 Assignment
- you can choose if you want to use R or Python
- Create a markdown document
- Choose a dataset from PANGEA
- Think about the following question: What do you want fo find out? What is the motivation for your analysis?
- Choose 5 of the following tasks. You can also come up with own ideas.
- Write a short text for each task explaining what you have been doing
11.2 Ideas
- Make some exploratory analysis:
- print the mean/ median/ standard deviation for each column
- print the total number of missing values
- print the number of missing values for each column
- Transform some of the columns/ Create new columns based on existing ones
- Create a subset of your data
- Exclude all rows with missing values
- Create a plot
- Write a function that takes a filename and some additional parameters. The function should create a plot and save it as “png” with that filename
- Create a linear model
11.3 Send us an e-mail
If you want to do an assignment, please send us an e-mail:
“I want to do an assignment.””
“I do not want a grade on my certificate. / I want to have a grade on my certificate./ I want to have a grade if it is better than <1.3, 1.7, 2.0, 2.3, 2.7, 3.0, …>”